This is where you can write an introduction to your document.
Install the required packages:
The paper (Markus F. et al.) uses two pairs as proxy for aridity: Ti/Al and Ti/Ca. The first pair is used for the last 200kyr and the second pair for the last 1200kyr. The data is from the KL15 core. A proposed alternative is the use of K as a proxy for aridity.
with literature (from Croudace ez Rothwell, chapter 2).
Get the data objects from targets:
data <- tar_read(data_kl15)
data_qf <- tar_read(missings_depth)
data_comp <- tar_read(data_kl15_comp)
data_clr <- tar_read(data_kl15_comp_clr) %>% as.data.frame()
data_ilr <- tar_read(data_kl15_comp_ilr) %>% as.data.frame()
data_alr <- tar_read(data_kl15_comp_alr) %>% as.data.frame()
variables <- colnames(data_comp)
This is the data from the Ocean Drilling Program close to Cyprus. Data contains several Cores, an age model and XRF data. XRF data is a proxy for the chemical composition of the sediment.
The analysis is done with Ti.AL as a proxy for wetness (?).
In the XRF data we have the following variables:
colnames(data)
## [1] "depth" "age" "Br_Area" "Rb_Area" "Sr_Area" "Zr_Area"
## [7] "Ru_Area" "Mg_Area" "Al_Area" "Si_Area" "S_Area" "K_Area"
## [13] "Ca_Area" "Ti_Area" "Fe_Area" "aggregate"
Variable selection is done in the target for data_kl15. Nevertheless, we need a list of element names for labeling.
# data_select$aggregate <- rowSums(data_select[3:ncol(data_select)])
summary(data)
## depth age Br_Area Rb_Area
## Min. : 1.0 Min. : 2.017 Min. : 256 Min. : 377.0
## 1st Qu.: 540.5 1st Qu.:111.830 1st Qu.:1115 1st Qu.: 735.0
## Median :1085.0 Median :262.289 Median :1381 Median : 815.0
## Mean :1083.9 Mean :265.790 Mean :1399 Mean : 823.2
## 3rd Qu.:1625.5 3rd Qu.:432.656 3rd Qu.:1668 3rd Qu.: 906.0
## Max. :2167.0 Max. :546.127 Max. :2970 Max. :1295.0
## Sr_Area Zr_Area Ru_Area Mg_Area Al_Area
## Min. :15372 Min. : 812 Min. : 9804 Min. : 42.0 Min. : 540
## 1st Qu.:23760 1st Qu.:2213 1st Qu.:13018 1st Qu.:108.0 1st Qu.:1142
## Median :25812 Median :2612 Median :13431 Median :126.0 Median :1354
## Mean :25838 Mean :2619 Mean :13499 Mean :126.6 Mean :1358
## 3rd Qu.:27922 3rd Qu.:2994 3rd Qu.:13958 3rd Qu.:146.0 3rd Qu.:1569
## Max. :34438 Max. :5051 Max. :16097 Max. :217.0 Max. :2311
## Si_Area S_Area K_Area Ca_Area Ti_Area
## Min. : 5190 Min. : 512 Min. : 2887 Min. :122343 Min. :1212
## 1st Qu.:11444 1st Qu.:1649 1st Qu.: 6978 1st Qu.:230544 1st Qu.:2489
## Median :13558 Median :1996 Median : 7808 Median :254441 Median :2882
## Mean :13541 Mean :1960 Mean : 7786 Mean :256036 Mean :2897
## 3rd Qu.:15560 3rd Qu.:2300 3rd Qu.: 8621 3rd Qu.:281045 3rd Qu.:3260
## Max. :22160 Max. :8278 Max. :11273 Max. :368604 Max. :4988
## Fe_Area aggregate
## Min. : 9929 Min. :192496
## 1st Qu.:18891 1st Qu.:326004
## Median :21831 Median :350588
## Mean :21853 Mean :349736
## 3rd Qu.:24332 3rd Qu.:375557
## Max. :37020 Max. :452012
Lets do some boxplots for our variables:
par(mfrow = c(1,2))
boxplot(data[, variables], main="Boxplots of the Element counts", names = variables, las = 2)
boxplot(data[, variables[variables != "Ca_Area"]], main="Boxplots of the Element counts", names = variables[variables != "Ca_Area"], las = 2)
par(mfrow = c(1,1))
We do have dominant role of Ca. Mg and Rb don’t show relevant variation (at first glance) and can probably be dropped in later steps of analysis (better argumentation is needed though). Variance of Sr, Si and Fe is also more dominant than the other elements (is this reflected in the Simplex as well?).
We want to analyse the Area minerals together with a time component, which gives us a Compositional Time Series (CTS).
The compositional data does not sum up to a constant value. In fact the spread of its sum is rather big. We want to know how the sum of the compositional parts is behaving over time:
ggplot(data, aes(x = age, y = aggregate)) +
geom_line() +
geom_smooth(method = "loess") +
labs(x = "Compositional sum over time", y = "Sum", title = "Trend compositional sum over Time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
It looks like its getting more compact with time until there is a turning point (which might make sense from a geological perspective?). Anyway, the non constant makes sense since we probably don’t measure all components of the sediment.
Due to the high variation of the sum, we could apply the concept of sparsely sampled data (Vgl. Greven et Steyer, 2023). The idea is that we assume that we only pull a sample from the whole composition.
Just for fun and orientation, let’s have a look at the distribution of elements at a certain time point:
row <- data[1, 3:(ncol(data)-1)]
# Convert the row to a data frame and reshape it to a long format
df <- as.data.frame(t(row))
df$Element <- rownames(df)
colnames(df)[1] <- "Value"
df$Value <- as.numeric(df$Value)
# Create the bar plot
ggplot(df, aes(x = Element, y = Value)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Element", y = "Value", title = "Values of Elements for 1 row") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In addition, let’s plot important compositional parts (not in the Simplex though!) over time:
par(mfrow = c(4,1))
# plot data with age and Ca_Area
ggplot(data, aes(x = age, y = Ca_Area)) +
geom_line() + geom_smooth(method = "loess") +
labs(x = "Time BP", y = "CA", title = "Count values for Ca over time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(x = age, y = Fe_Area)) +
geom_line() +
geom_smooth(method = "loess") +
labs(x = "Time BP", y = "Fe", title = "Count values for Fe over time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(x = age, y = Sr_Area)) +
geom_line() +
geom_smooth(method = "loess") +
labs(x = "Time BP", y = "Sr", title = "Count values for Sr over time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(data, aes(x = age, y = K_Area)) +
geom_line() +
geom_smooth(method = "loess") +
labs(x = "Time BP", y = "K", title = "Count values for K over time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
There are a few hints for extraordinary events that might have a strong influence on the linear optimization problem: - ~280 BP (and 410 BP) there is a Crash in Ca and Sr Compare: “Element ratios found useful as proxies by Rothwell et al. (2006) are: • Sr/Ca (indicating high-Sr aragonite requiring a shallow-water source)” => Actually the ration does not change at all at this events. But the ratio of Sr & Ca to all other might change. But when we take the compositional sum into account it looks like these events are characterized by low aggregated counts due to the crash in Ca and Sr counts, which might be a problem with the Cores an measurement error of an area where the elementary density of the sediment is low (why?). In any case we should have in mind that these extreme count variations have a strong influence on the optimization algorithm of PCA. But this effect is much stronger when dealing with absolute counts than with compositional parts. That could be an example to show the effect of spurios correlation and the advantage of CoDa.
data_kl15_agem <- read.table("data/Africa_NE_200/data/data_KL15-2_smooth_spline_ages.txt", header = TRUE, sep = "\t")
summary(data_kl15_agem)
This appears to be confidence intervals for the age data. “accrate” is the accumulation rate, which indicates the speed of sedimentation.
We also need the data with “quality flags”
summary(data_qf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 584 1079 1068 1573 2068
TODO: do we need z-transformations after the projection onto the Simplex? TODO: Plots: differentiate cts, arl, clr, ilr
Data reading functionality is stored in the /scripts directory. However, targets are used for the rendering process of this html.
As an entree point, we have to two R packages at hand for compositional data analysis: compositions and robCompositions. The first one is more general (and more popular)[https://www.rdocumentation.org/packages/compositions/versions/2.0-8] and the second one is more focused on robust statistics, including (“robust” PCA)[https://www.rdocumentation.org/packages/robCompositions/versions/2.4.1].
In the following we take a first look at centered log ratios (clr), which are the most recommended technique. Anyway, bafter some deliberation it seems that isometric log ratios (ilr) are more appropriate for our data and methods, that is for data centered around one dimension (Ca) ? and for PCA.
Let’s do the boxplot with clr coordinates:
par(mfrow = c(3,1))
boxplot(data_clr,
main = "Boxplots of the clr coordinates for each element", las = 2)
# boxplot(data_clr[, variables[variables != "Ca_Area"]], main="Boxplots of the clr coordinates for each element", las = 2)
boxplot(data_ilr,
main = "Boxplots of the ilr coordinates for each element", las = 2)
boxplot(data_alr,
main = "Boxplots of the alr coordinates for each element", las = 2)
par(mfrow = c(1,1))
We can see that the transformed coordinates are much less diverged than the original count values.
Der Erwartungswert einer Komposition kann wie folgt berechnet werden:
Die Komposition ist ein Vektor \(\mathbf{x} = (x_1, \ldots, x_D)\), wobei \(D\) die Anzahl der Komponenten (Elemente) ist und \(\sum_{i=1}^D x_i = 1\) (or a arbitrary constant due to the scale invariance of CoDa). Die clr-Transformation ist dann definiert als: \(\text{clr}(\mathbf{x}) = \left(\ln\frac{x_1}{g(\mathbf{x})}, \ldots, \ln\frac{x_D}{g(\mathbf{x})}\right)\) wobei \(g(\mathbf{x}) = \left(\prod_{i=1}^D x_i\right)^{1/D}\) das geometrische Mittel der Komponenten ist.
Für die clr-transformierten Daten ist der Erwartungswert jeder Komponente berechnen:
\(\mathbb{E}[\text{clr}(\mathbf{X})_i] = \mathbb{E}\left[\ln\frac{X_i}{g(\mathbf{X})}\right]\)
Dabei ist \(\mathbf{X}\) eine Zufallsvariable, die Werte im Simplex annimmt, und \(\mathbb{E}\) bezeichnet den Erwartungswert.
Without further ado, since this is the most robust transformation for compositional data.
# What happens with the relative information if we constrain each row to sum up to 1?
data_comp_1 <- constSum(data_comp, const = 1)
# all(rowSums(data_comp_1) == 1)
# data_comp_1$aggregate <- rowSums(data_comp_1)
ratio_ti_al <- data_comp$Ti_Area / data_comp$Al_Area
head(ratio_ti_al)
## [1] 3.203147 4.090747 3.507299 2.848214 2.804781 2.367790
ratio_ti_al_1 <- data_comp_1$Ti_Area / data_comp_1$Al_Area
head(ratio_ti_al_1)
## [1] 3.203147 4.090747 3.507299 2.848214 2.804781 2.367790
# Bang!
That looks nice, but is the relative information (i.e. the ratio of one mineral to the others) more important than the absolute value (i.e. the amount of CA that is in the sediment)?
By the way, the transformation above proves that relative information is scale invariant while absolute information is not.
We can calculate the Aitchison distance between all elements for each point of time:
aDist(data_comp[1:7, ])
## 2 3 4 5 6 7
## 3 0.4595200
## 4 0.3798355 0.2553081
## 5 0.4565667 0.8014176 0.7326206
## 6 0.5914449 0.9314431 0.8203699 0.2296415
## 7 0.7604190 1.0833814 1.0008768 0.4475818 0.4355657
## 8 0.4894409 0.8388048 0.7627073 0.3260665 0.3427500 0.6094309
Now we can calculate the clr transformation for the data:
clr_coord <- cenLR(data_comp)$x.clr
ggplot(clr_coord, aes(x = Ti_Area, y = Al_Area)) +
geom_point(size = 2) +
xlab("Ti Area (clr)") +
ylab("Al Area (clr)")
Interpretation: In very unprecise terms: The log of the ratio of equal components is zero, i.e. they are of equal importance. The clr transformation takes the natural log of each value in a row and divides it by the square root of the products of the remaining parts. In very unqualified language, that is the relevant information of one part in relation to all the others. If the clr is positive, the part is more important than the average, if it is negative, it is less important.
So here Al and Ti clearly have a linear trend, i.e. if one is less important then so is the other. Also the spread is higher for more more negative values, which makes sense since their importance goes into oblivion.
# TODO: the dataset needs to be transformed to a pivot table?
res <- hclust(dist(data_comp_1), method = "single")
plot(res)
Thats not what we want since we calculate the distances between observations. But it is actually really interesting to see. What is wrong with observation 17! And additionaly clusters of observations (that is times) are relevant as well.
But now, how do we get the dendogram for the distances between the elements?
So at first sight, that is disappointing. Even regarding relative information Ca clearly dominates the picture.
Also we should try other clustering methods then single linkage.
res <- hclust(dist(t(data_comp_1)), method = "ward.D2")
plot(res)
Does not change much.
There are ways out of that dilemma. We could use isometric transformations with Ca as pivot!
Steps of analysis are build on: https://www.geo.fu-berlin.de/en/v/soga-r/Advances-statistics/Multivariate-approaches/Principal-Component-Analysis/PCA-the-basics/Data-preparation/index.html
Something to have in mind is that PCA is scale sensitive, while CoDa is scale invariant. That is an additional reason for the transformation in clr, alr or ilr. It also illustrates the fact that any steps are unneccassary in CoDaPCA.
With robCompositions:
Aim: - to reduce our 14-D space without loosing too much of the relative information, that is the compositional aspects of the data. - describe the relation between the observations (in time) and the compositional parts.
In the orthonormal space we calculated independent ilr coordinates (why not clr?).
The general idea of PCA is a reduction in dimensionality. But from the persepctive of identifying relevant environmental processes it might be as well the deduction of non-relevant
There is a bunch of methods to calculate PCA.
With robCompositions we can use the pcaCoDa function, which uses eigen-decomposition of the covariates. The composition is represented in orthonormal coordinates, which are the ilr coordinates, and then standard PCA is applied.
We can calculate the eigenvalues and eigenvectors for CoDa:
pca_cov <- cov(data_comp)
pca_eigen <- eigen(pca_cov)
rownames(pca_eigen$vectors) <- colnames(data_comp)
pca_eigen$values
## [1] 1.377245e+09 2.448005e+07 1.352429e+06 1.135047e+06 5.617667e+05
## [6] 2.839764e+05 1.172108e+05 1.005787e+05 5.654494e+04 2.334446e+04
## [11] 3.776002e+03 2.456838e+03 3.230698e+02
The eigenvalues are an indication of th-e explained variance from each component and thus used to order the components. The sum of the eigenvalues is equal to the the total variance of the data (cp. chapter 28 SODA).
The eigenvectors of the count data are as follows:
pca_eigen$vectors
## [,1] [,2] [,3] [,4] [,5]
## Br_Area -0.0016513059 0.012094298 0.0904767401 0.051332878 0.0084396852
## Rb_Area -0.0016589730 -0.021268147 -0.0059407932 -0.011993584 -0.0046628877
## Sr_Area 0.0705672419 0.145337466 -0.9287096087 -0.264596474 0.1391734463
## Zr_Area -0.0081573213 -0.088850062 -0.0186943060 0.003622246 0.0501037912
## Ru_Area -0.0036385790 -0.033837574 -0.1349137357 0.098896054 -0.8617760261
## Mg_Area 0.0001532339 -0.003591217 0.0007526123 -0.006008152 0.0009817243
## Al_Area 0.0005499198 -0.058957978 0.0188053529 -0.084199767 -0.0091908226
## Si_Area -0.0003415904 -0.580390635 0.1448738124 -0.712637416 -0.0623198113
## S_Area 0.0018014708 -0.006496839 0.1001807240 0.132316528 0.4634002069
## K_Area -0.0038077042 -0.251666998 -0.0449359449 -0.215627450 0.1076378191
## Ca_Area 0.9958652256 -0.054661533 0.0495307906 0.050360631 -0.0091210912
## Ti_Area -0.0061717508 -0.104524125 -0.0199973826 -0.020762126 0.0213954923
## Fe_Area -0.0559569447 -0.742701291 -0.2729005236 0.579196288 0.0673324943
## [,6] [,7] [,8] [,9] [,10]
## Br_Area 0.677610880 0.551036304 0.279347593 0.378538495 -0.0530405838
## Rb_Area 0.010450384 0.032498262 -0.028665992 0.059197277 -0.0648107012
## Sr_Area 0.126693709 0.008015885 0.066948790 -0.037129820 0.0048693997
## Zr_Area -0.044998877 0.639001351 -0.432277795 -0.564263927 -0.2701922097
## Ru_Area 0.334203342 -0.183805370 -0.272818922 -0.090956507 -0.0015418055
## Mg_Area 0.007195458 -0.001001356 0.004189193 -0.001427534 -0.0083199130
## Al_Area 0.030917652 -0.020836807 0.053623096 -0.017940831 -0.0343112584
## Si_Area 0.131901700 -0.081349746 0.223606517 -0.204773417 0.0088801954
## S_Area 0.619594338 -0.439085818 -0.316465307 -0.281424255 0.0336513587
## K_Area -0.069859923 -0.013953792 -0.650250708 0.635099589 -0.1985293169
## Ca_Area -0.011736497 0.006130943 -0.003203310 0.002241609 0.0002989051
## Ti_Area -0.013295400 0.229664888 -0.232380242 0.009396493 0.9364201756
## Fe_Area -0.058893264 -0.017432446 0.159645903 0.016072337 -0.0346680194
## [,11] [,12] [,13]
## Br_Area 0.044533815 0.001171370 -6.062055e-03
## Rb_Area -0.932006231 0.344829412 4.592670e-02
## Sr_Area 0.002834707 0.002120674 -1.627233e-03
## Zr_Area 0.012390863 -0.019943296 -2.353911e-03
## Ru_Area 0.004250358 -0.001267352 -3.577651e-04
## Mg_Area 0.022676397 -0.072992029 9.969783e-01
## Al_Area -0.344698465 -0.927696845 -6.158593e-02
## Si_Area 0.041524677 0.112559038 -1.329200e-03
## S_Area -0.019850345 0.009789526 -2.295180e-03
## K_Area 0.067690217 -0.035842892 -3.966174e-03
## Ca_Area -0.001060349 0.000208423 7.452767e-05
## Ti_Area -0.055940349 -0.025689797 8.015954e-03
## Fe_Area 0.006284210 -0.005300817 -9.712624e-05
If we compare them with the eigenvectors of CoDa, we see clear differences and from the first point of view a lineare Deconstruction is only possible in CoDa (f.e. with absolute counts the first component is only Ca).
Interpretation for values in euclidean space: We can already infer from the numerical values that the first component is almost exclusively composed of Ca (with roughly 50% of explained variance). Depending on the geological interpretation, it might be usefull to visNetwork Ca as a basic component that we are not that interested in and want to take out of the equation to put a focus on the other compositional parts?
If we compare that with our results from the transformed coordinates we can clearly see the difference between variation in absolute count numbers vs. variation of transformed coordinates in the simplex, what in my point of view, is a strong argument that PCA should only be considered for transformed data. All in all variations in relative information are very different from absolute variations (also see the boxplots).
We can use ternary Diagrams to explore relative information for each component:
data_3_comp <- data[, c("Br_Area", "Sr_Area", "Fe_Area")]
ternaryDiag(data_3_comp, line = "pca")
There is something off with this diagram. TODO: investigate
The ‘robCompositions’ package gives a practical functionlaity to do PCA, but its inner workings are not very clear (yet).
## compositional PCA, non-robust
p_comp <- pcaCoDa(data[, variables], method = "classical")
## compositional PCA, robust
set.seed(234) # to reproduce the result exactly
p_comp_rob <- pcaCoDa(data[, variables], method = "robust")
summary(p_comp)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.4644809 0.3220378 0.2342279 0.19970525 0.12579737
## Proportion of Variance 0.4776027 0.2295856 0.1214530 0.08828968 0.03503274
## Cumulative Proportion 0.4776027 0.7071882 0.8286412 0.91693092 0.95196366
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.09531499 0.07253847 0.053211408 0.042032413
## Proportion of Variance 0.02011191 0.01164844 0.006268166 0.003911106
## Cumulative Proportion 0.97207557 0.98372401 0.989992176 0.993903282
## Comp.10 Comp.11 Comp.12
## Standard deviation 0.034003301 0.033404581 0.02195261
## Proportion of Variance 0.002559606 0.002470262 0.00106685
## Cumulative Proportion 0.996462888 0.998933150 1.00000000
summary(p_comp_rob)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.3476417 0.2722106 0.1848371 0.14948644 0.1178311
## Proportion of Variance 0.4265377 0.2615192 0.1205791 0.07886735 0.0490020
## Cumulative Proportion 0.4265377 0.6880570 0.8086361 0.88750342 0.9365054
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.08976346 0.06709033 0.04437459 0.035062141 0.030987120
## Proportion of Variance 0.02843759 0.01588596 0.00694964 0.004338809 0.003388879
## Cumulative Proportion 0.96494302 0.98082898 0.98777862 0.992117425 0.995506304
## Comp.11 Comp.12
## Standard deviation 0.029044092 0.020728727
## Proportion of Variance 0.002977209 0.001516488
## Cumulative Proportion 0.998483512 1.000000000
plot(p_comp_rob, type = "l")
We can see that there is no big jump in the explanatory power of the components. For later analysis, we need to qualify the meaningfulness of each component (See chapter environmental processes). Also the Kaiser criterion can be used to determine the number of components.
The ‘robCompositions’ package seems to make high level transformations if the input.
str(p_comp_rob)
The PCA is calculated (with ‘compositions’) with a Singular Value Decomposition: \[ \operatorname{clr}\left(\mathbf{X}^*\right)=\mathbf{U} \cdot \mathbf{D} \cdot \mathbf{V}^t \]
The columns of \(V\) are the loadings or principle components of the clr-transformed data points. \(D\) is a diagonal matrix with the “eigenvalues”, which values can be interpretated as the sd of the coordinates on the new orthonormal basis (\(V\)). \(U\) contains the scores or standardized coordinates, which we can use to derive synthetic data. (Cp. Boogart et al., page 177)
acomp <- acomp(data_comp)
pca_rs <- princomp(acomp)
# str(pca_rs)
pca_rs$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Br_Area 0.679 0.472 0.272 0.374 0.121
## Rb_Area -0.129 0.143 0.150 -0.172 -0.479 0.737 0.136 -0.134
## Sr_Area 0.159 -0.466 0.264 -0.161 0.124 0.195 -0.115 0.697
## Zr_Area -0.219 0.344 -0.260 -0.298 0.707 0.219 -0.207
## Ru_Area -0.148 0.374 -0.213 -0.108 -0.272 -0.240 -0.507 0.520
## Mg_Area -0.189 -0.174 0.561 -0.720 -0.115
## Al_Area -0.234 -0.194 0.347 0.425 -0.313 0.151 0.246
## Si_Area -0.242 -0.166 0.246 0.355 -0.118
## S_Area 0.438 -0.752 -0.381
## K_Area -0.172 0.146 0.242 -0.113
## Ca_Area 0.146 -0.522 0.135 0.169 0.257 -0.225 -0.645
## Ti_Area -0.209 0.204 -0.106 -0.385 0.646 0.420
## Fe_Area -0.193 0.217 -0.197 -0.298 -0.435 -0.190 -0.664
## Comp.11 Comp.12
## Br_Area
## Rb_Area -0.142
## Sr_Area 0.153
## Zr_Area
## Ru_Area 0.176
## Mg_Area
## Al_Area -0.338 -0.474
## Si_Area 0.181 0.771
## S_Area
## K_Area 0.808 -0.375
## Ca_Area -0.165 -0.108
## Ti_Area -0.283
## Fe_Area -0.188
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615 0.692
## Comp.10 Comp.11 Comp.12
## SS loadings 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077
## Cumulative Var 0.769 0.846 0.923
pca_rs$sdev
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.46459057 0.32211378 0.23428321 0.19975239 0.12582706 0.09533749 0.07255559
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## 0.05322397 0.04204233 0.03401133 0.03341247 0.02195779
Scores: Interpretation(?), but can be used to reconstruct the project values in \(\boldsymbol{x}_{\mathrm{i}}^{\prime}\) Relevant for time analysis: We can identify how much each obersavtion contributes to a component. In case we can actually identify the latent processes, we can observe the time frames when they are most relevant in the data.
head(p_comp_rob$scores)
head(pca_rs$scores)
Loadings: In euclidean space indication of the contribution of each compositional part to the respective component
p_comp_rob$loadings
pca_rs$loadings
For ‘compositios’ results non significant values are not shown. But the explained variance can’t be right. We can calculate the explained variance for the first three components:
sum(pca_rs$sdev[1:3]^2/mvar(acomp))
## [1] 0.8286412
83% is consistent with the results from ‘robCompositions’.
Eigenvalues: sum up to one and indicate the contribution of each component to the total variance
p_comp_rob$eigenvalues
## [1] 0.1208547782 0.0740986025 0.0341647596 0.0223461960 0.0138841789
## [6] 0.0080574789 0.0045011125 0.0019691040 0.0012293537 0.0009602016
## [11] 0.0008435593 0.0004296801
Let’s plot the components in relation to the (dominant) first component:
TODO! => check robCompositions to reconstruct the plots below since the choice argument does not work as intended.
par(mfrow=c(5,1), mar = c(4,4,2,2))
# biplot(p_comp, xlabs = rownames(data))
biplot(p_comp_rob)
# biplot(p_comp, xlabs = rownames(data))
biplot(p_comp_rob, choices = c(2,3))
# biplot(p_comp, xlabs = rownames(data))
biplot(p_comp_rob, choices = c(2,2))
# biplot(p_comp, xlabs = rownames(data))
biplot(p_comp_rob, choices = c(1,5), xlabs = rownames(data))
# biplot(p_comp, xlabs = rownames(data))
biplot(p_comp_rob, choices = c(1,6), xlabs = rownames(data))
par(mfrow=c(1,1))
That also looks like we can work with. While the first component seems to be associated with Ca (no surprise) and Sr, the second is a combination of Zr, Ti and Fe and maybe most interestingly Br seems to take a special place and is probably dominant in the third component.
TODO: Calculate scores for the components and plot them over time. TODO: Validate result and compare it with euclidean distances -> explain the difference!
We can use the loadings to characterize the components.
loadings <- matrix(pca_rs$loadings, nrow = 13, ncol = 12)
rownames(loadings) <- variables
par(mfrow = c(3,2), mar = c(4, 8, 2, 2))
for (i in 1:6) {
df <- as.data.frame(loadings[,i])
df$rowname <- rownames(loadings)
barplot(df[,1],
horiz = TRUE,
names.arg = df$rowname,
las = 1,
col = ifelse(df[,1] > 0, "#b0451a", "#1fd844"),
main = paste("Loadings of PC", i),
xlab = "Loadings")
abline(v = 0, lty = 2)
}
par(mfrow = c(1,1))
This graphic gives a very rough idea of the characteristica of the components. A better method would be to construct prototype compositions.
By the way, in first sight it seems like it is not Ca that dominates the variation in the (transformed) data, but Br. What does Br do? Wikipedia => the high solubility of the bromide ion (Br−) has caused its accumulation in the oceans (Yeah that makes sense, but probably does not tell us much about specific clima-relevant processes?).
We can use score plots to further investigate clusters of time dependend observations on a two dimensional plane with two components (and that is if possible two environmental processes).
TODO
We can use principal component to reconstruct the data with all or some of the components. One idea would be calculate an “average” composition and show the influence of every component on the compositional parts (over time).
(pca_loading <- pca_rs$loadings[, 1:2])
## Comp.1 Comp.2
## Br_Area 0.67945509 0.4721098572
## Rb_Area -0.12941364 0.1432708096
## Sr_Area 0.15903986 -0.4663330513
## Zr_Area -0.21891889 0.3440904060
## Ru_Area 0.06190432 -0.1478030756
## Mg_Area -0.08652991 -0.1889374848
## Al_Area -0.23384207 -0.0396757464
## Si_Area -0.24160180 0.0024134101
## S_Area 0.43842460 -0.0183811318
## K_Area -0.17242527 0.0005571997
## Ca_Area 0.14565293 -0.5221431057
## Ti_Area -0.20922825 0.2042282659
## Fe_Area -0.19251697 0.2166036472
# calculate squared sum of loadings to check if the sum is 1
apply(pca_loading^2, 2, sum)
## Comp.1 Comp.2
## 1 1
Since \(\boldsymbol{x}_{\mathrm{i}}^{\prime}\) is a linear combination of the observations and their loadings, we can use them to reconstruct the original data (plus error).
Remember that we need the transformed variables in the Simplex. To get the variables in Euclidean space (i.e. the original count space we need to retransform the result).
# first observation projected on PC1
# remember that we need the transformed variables in the Simplex. To get the variables in Euclidean space (i.e. the original count space we need to retransform the result)
obs1 <- acomp[2, ]
obs1_matrix <- matrix(as.numeric(obs1), nrow = 1)
obs1_matrix %*% pca_loading[, 1]
## [,1]
## [1,] 0.08278673
This value is the projection of our 13-dimensional transformed composition (i.e the first observation) of it onto the first component. It is also the score value for the first observation and PC1, which is nothing else than the linear transformation of the first observation: \[ z_{i 1}=\phi_{11} \mathbf{x}_{i 1}+\phi_{21} \mathbf{x}_{i 2}, \ldots, \phi_{d 1} \mathbf{x}_{i d} i=1,2, \ldots, n \]
Those linear transformations can be very helpful to analyse compositional variations over time (f.e. which component is more dominant over time).
Unfortunately, it is not possible to clearly interpret the clr-coordinates (real or synthetic). If we deal with synthetic data, we need to transform them back to euclidean space (which is not possible for ilr? -> check!)
# head(pca_rs$scores)
Z <- matrix(as.numeric(acomp), nrow = 2119) %*% pca_rs$loadings
The scores in the pca object are not correct!
data <- tar_read(data_kl15)
agem <- data$age
Z <- cbind(Z, agem)
ggplot(Z, aes(x = agem, y = Z[, 1])) +
geom_line() +
geom_smooth(method = "loess") +
labs(x = "Time BP", y = "Scores PC1", title = "Scores for PC1 over time") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
So that shows us some clear trends. If we would have a clear interpretation of the component (f.e. proxy for aridity), we can show a transformed picture that reduces the relative information to just this process and eliminates the “noise” of other processes. Additionally we can show the impact of the component onto an “average” composition to construct prototypical compositions of different processes.
The idea is quite similar to PCA with a decomposition of the data into loadings and scores:
\[ \mathbf{X}_{n \times q}=\mathbf{M}_{n \times q} \mathbf{V}_{q \times q}^T \]
Trough the separation of principal components and error we can separate signal and noise within the data to identify natural processes:
“In EMMA, in addition to implementing dimensionality reduction through PCA, factor analysis is applied to simplify the structure of the eigenvector matrix, V, by an orthogonal rotation of the axes in the feature space. This rotation removes the order of the eigenvectors and redistributes the loadings more evenly − a condition often used to decipher natural processes (Dietze el al. 2012).”
EMMA makes use of an iterative algorithm to find the best rotation of the axes in the feature space. The application of EMMA within the compositional framework offers some interesting perspectives:
But since one model is just a random selection from the sample space, we need a large set of models to get reliable results with EMMA (which we don’t have? => We can treat each time dependend observation as a sample -> that’s interesting and probly allows for time dependend components).
with synthetic data resulting from 4 natural processes (r=4) that are randomly resampled.
We have 45 observations of 20 grain size classes in our sample (from a population of 99 classes). Each class value indicates the volumne and the volumne of all classes adds up to a constant.
Since it is simulated data we know that each natural process has a specific profile (thats what I called prototype earlier).
By the way, since we have a continous space for grain sizes, what we are observing are densities and the data can be used to replicate the study from Prof. Greven.
For our data, we have the disadvantage that elements are not ordered. Therefore the identification of profiles is more complicated. But the main idea remains the same. The idea we are simulating is that each natural process has an ideal profile (here it is the four different densities - with discrete data it would be specific compositions). The data consisting of our 2117 observations is a mixture of the natural processes (and the error component -> which is only contained in the removed components => realistic?). If we use this concept, we need to introduce a time component since the processes we are interested in are dynamically changing over time.
load(url("http://www.userpage.fu-berlin.de/~soga/300/30100_data_sets/data_emma.RData"))
str(df.soil.samples[1:20, 1:20])
# first sample (observation)
phi <- colnames(df.soil.samples)
plot(phi, df.soil.samples[1, ], type = "l", ylab = "")
# Step 1
X <- as.matrix(df.soil.samples) # convert data.frame object to matrix object
c <- 1 # set constant sum, can be 1, or 100 (%), or any other meaningful constant
X <- X / apply(X, 1, sum) * c
# Step 2
lw <- 0.05 # set percentile
qts <- function(X, lw) quantile(X, c(lw, 1 - lw), type = 5) # define function
ls <- t(apply(X, 2, qts, lw = lw)) # apply function column-wise
W <- t((t(X) - ls[, 1]) / (ls[, 2] - ls[, 1])) # calculate weight transformation
# Step 3
## Eigen space extraction
A <- t(W) %*% W # calculate minor product
EIG <- eigen(A) # calculate eigenvectors and eigenvalues
V <- EIG$vectors[, order(seq(ncol(A), 1, -1))]
Vf <- V[, order(seq(ncol(A), 1, -1))] # eigenvector matrix
L <- EIG$values[order(seq(ncol(A), 1, -1))]
Lv <- cumsum(sort(L / sum(L), decreasing = TRUE)) # normalized eigenvalues
# Step 4
r <- 4
# since factor analysis is deterministic regarding the number of components, we either need a very good argument or an iterative process to find optimal r
Vr <- do.call(varimax, list(Vf[, 1:r]))
# Step 5
# apply varimax function
library(limSolve)
# calculate the end-member loadings
Vq <- Vr$loadings[, order(seq(r, 1, -1))] # extract and sort factor loadings
Vqr <- t(t(Vq) / apply(Vq, 2, sum)) * c # rescale
Vqr <- t(Vqr)
Vqn <- t((Vqr - apply(Vqr, 1, min)) / # normalize factor loadings column-wise
(apply(Vqr, 1, max) - apply(Vqr, 1, min)))
# calculate the end-member scores matrix by non-negative least square fitting
Mq <- matrix(nrow = nrow(X), ncol = r)
for (i in 1:nrow(X)) {
Mq[i, ] <- nnls(Vqn, as.vector(t(W[i, ])))$X
}
# Step 6: Rescale
s <- (c - sum(ls[, 1])) / apply(Vqn * unname(ls[, 2] - ls[, 1]), 2, sum)
Vqs <- Vqn
for (i in 1:r) {
Vqs[, i] <- t(s[i] * t(Vqn[, i]) * (ls[, 2] - ls[, 1]) + ls[, 1])
}
Vqsn <- t(t(Vqs) / apply(Vqs, 2, sum)) * c # rescale loading matrix
# Visualisation of the synthetic prototypes:
library(RColorBrewer)
par(mfrow = c(1, 1))
cols <- brewer.pal(r, "Set2")
## Modeled data ##
plot(phi, Vqsn[, 1],
type = "l",
ylab = "Volume proportion",
main = "Modeled loadings",
col = cols[1], ylim = c(0, 0.10), lwd = 2
)
lines(phi, Vqsn[, 2], col = cols[2], lwd = 2)
lines(phi, Vqsn[, 3], col = cols[3], lwd = 2)
lines(phi, Vqsn[, 4], col = cols[4], lwd = 2)
## natural end members ##
lines(phi, EMa[[1]], col = "grey", lwd = 2)
lines(phi, EMa[[2]], col = "grey", lwd = 2)
lines(phi, EMa[[3]], col = "grey", lwd = 2)
lines(phi, EMa[[4]], col = "grey", lwd = 2)
legend("topright",
legend = c("EM 1", "EM 2", "EM 3", "EM 4", "natural end members"),
lwd = 2,
col = c(cols, "grey"),
cex = 0.85
)
The result underestimates the natural processes (which can be expected with a linear fit), but clearly identifies them. The simulated data has a lot of advantages: - known number of processes - no noise - equally distributed processes - stable processes over all observations
We should be able to reproduce that with an iterative process to identify the optimal number of processes. Additionally, we need an idea about the seperation of signal and noise and their relations to the elements in the data! Also there should be an argument to get rid of the rescaling steps when using the compositional framework. We will compare both ways.
ToDo
A clr coordinate contains the relative information of one part related to all the others. I am not a geologist, but if that is what we are looking for, might depend on the process of sedimentation. And to be honest, I have no idea about that one.
# this function is an artifact of the translation of mathplot functions and requires the read script to run beforehand
graph_ts_200ka(display = TRUE, datasets = c("data_odp_967_22", "data_odp721_722_terr", "data_odp_709", "data_icdp_chb", "data_kl09", "data_kl11", "data_kl15", "data_lake_tana"), index = 1)
ToDo: It would be more informative to have point data instead of polygons.
The first step is to reproduce the cross correlation analysis from the paper (especially figure 3).
The following reads data objects from matlab files.
# Install the R.matlab package if it's not already installed
if (!require(R.matlab)) {
install.packages("R.matlab")
}
# Load the R.matlab package
library(R.matlab)
# Use the readMat() function to read a .mat file
data <- readMat("data/Africa_NE_200/data_pentagramm_5_1200_025kyr_XCorr.mat")
Visualize the status of the targets pipeline:
tar_visnetwork()
## - The project is out-of-sync -- use `renv::status()` for details.
##
## Attaching package: ‘dplyr’
##
## The following objects are masked from ‘package:stats’:
##
## filter, lag
##
## The following objects are masked from ‘package:base’:
##
## intersect, setdiff, setequal, union
##
## Welcome to compositions, a package for compositional data analysis.
## Find an intro with "? compositions"
##
##
## Attaching package: ‘compositions’
##
## The following objects are masked from ‘package:stats’:
##
## anova, cor, cov, dist, var
##
## The following object is masked from ‘package:graphics’:
##
## segments
##
## The following objects are masked from ‘package:base’:
##
## %*%, norm, scale, scale.default